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ABSTRACT 

Motivation: Nucleosomes are the basic elements of chromatin 
structure. They control the packaging of DNA and play a critical 
role in gene regulation by allowing physical access to transcription 
factors. The advent of second-generation sequencing has enabled 
landmark genome-wide studies of nucleosome positions for several 
model organisms. Current methods to determine nucleosome 
positioning first compute an occupancy coverage profile by mapping 
nucleosome-enriched sequenced reads to a reference genome; 
then, nucleosomes are placed according to the peaks of the 
coverage profile. These methods are quite accurate on placing 
isolated nucleosomes, but they do not properly handle more 
complex configurations. Also, they can only provide the positions 
of nucleosomes and their occupancy level, whereas it is very 
beneficial to supply molecular biologists additional information about 
nucleosomes like the probability of placement, the size of DNA 
fragments enriched for nucleosomes and/or whether nucleosomes 
are well positioned or 'fuzzy' in the sequenced cell sample. 
Results: We address these issues by providing a novel method 
based on a parametric probabilistic model. An expectation 
maximization algorithm is used to infer the parameters of the mixture 
of distributions. We compare the performance of our method on two 
real datasets against Template Filtering, which is considered the 
current state-of-the-art. On synthetic data, we show that our method 
can resolve more accurately complex configurations of nucleosomes, 
and it is more robust to user-defined parameters. On real data, 
we show that our method detects a significantly higher number of 
nucleosomes. 

Availability: Visit http://www.cs.ucr.edu/~polishka 
Contact: stelo@cs.ucr.edu or polishka@cs.ucr.edu 



1 INTRODUCTION 

The study of the processes governing gene regulation is a 
central problem in molecular biology. One of the key factors 
influencing gene expression is the complex interaction between 
chromatin structure and transcription factors. The fundamental unit 
of chromatin is the nucleosome, composed of 146 ± 1 bp of DNA 
wrapped 1.65 turns around a protein complex of eight histones. 
To elucidate the role of the interactions between chromatin and 
transcription factors, it is crucial to determine the location of the 
nucleosomes along the genome. In general, the more condensed 
the chromatin, the harder it is for transcription factors and other 
DNA binding proteins to access DNA and carry out their tasks. 
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The more accessible is the DNA, the more likely surrounding 
genes are actively transcribed. The presence (or the absence) of 
nucleosomes directly or indirectly affects a variety of other cellular 
and metabolic processes like recombination, replication, centromere 
formation and DNA repair. 

A handful of experimental techniques have been developed for 
genome- wide mapping of nucleosomes. For instance, one can 
enrich for genomic regions that are either bound to histones 
(typically via chromatin immuno-precipitation or ChIP) or for 
genomic regions that are free of nucleosomes (linkers). For in stance, 
MAIN Es (MNase-assisted Isolation Nucleosomal Elements) jZaretl 
120051) isolates the portions of the DNA that are attached to 
nucleosomes, because MNase preferentially digests linker regions. 
Then, microarrays (ChlP-chip) or sequencing (ChlP-seq/MNase- 
seq) are applied to the enriched DNA. 

In this article, we concentrate on the analysis of sequencing 
data, given the prevalence of MNase/ChiP-seq experiments in the 
recent literature. The computational analysis of the sequencing data 
usually consists of two main steps: (i) a nucleosome occupancy 
coverage is computed from the process of mapping nucleosome- 
enriched sequenced reads to a reference genome, followed by some 
normalization steps and (ii) nucleosomes are placed according to the 
peaks of the coverage profile. 

Approaches based on peak calling are computationally fast and 
quite accurate in resolving isolated (stable or arrayed) nucleosomes; 
however; they are not entirely reliable when more complex 
nucleosome configurations are present. Observe that while it is 
physically impossible for two nucleosomes to be 'overlapping' on 
the same location on a DNA strand, it is quite common that the 
population of cells from which the enriched DNA was obtained 
had nucleosomes slightly 'off- sync' at a given genomic coordinate 
(transient interaction). As a consequence, the resulting coverage 
profile will exhibit a 'blurring' of the peaks. 

Molecular biologists distinguish the case of 'overlapping' 
nucleosome from 'fuzzy' nucleosomes or 'fuzzy' regions (see 
Fig.[TJ. For overlapping nucleosomes, the overlap is relatively small; 
in the 'fuzzy' case, several nucleosomes ar e mutually overlapping 
for a significant fraction of their size (see lZhang and Pughl 120111 
for a review). By introducing a threshold parameter on the allowed 
overlap, one can define precisely the line between 'fuzzy' and 
'overlapping'. 

Another shortcoming of peak-calling approaches is that they 
can only report nucleosome positions and/or occupancy level. 
Molecular biologists, however, need additional information about 
nucleosomes. For instance, they are interested in the level of 
'fuzziness' in certain genomic locations with respect to coding 
regions (i.e. well positioned for all the cells in the sample, or 
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NORMAL: Accurate Positioning of Nucleosomes 




Fig. 1. Nucleosomes are represented by ovals, mapped reads by arrows 
(which correspond to 5 ; -> 3 f prefixes of nucleosome-bound DNA). Coverage 
profiles are represented as time series for forward (top) and reverse (bottom) 
strands: the line style (solid, dotted and dashed) indicates peaks originating 
from distinct nucleosomes. (a) represents a stable nucleosome; (b) illustrates 
overlapping nucleosomes; and (c) represents 'fuzzy' nucleosomes 

'blurred'), or how strong is the binding between nucleosomes and 
DNA. To address these shortcomings, we propose a method that 
determines the accurate position of the nucleosomes independently 
from the amount of overlaps in the nucleosomes. This method can 
also extract other important statistics about nucleosomes, e.g. the 
probability that a nucleosome is actually present, a measure of 
nucleosome 'fuzziness', and the size of DNA fragments enriched 
for nucleosomes. 

Here, we propose a parametric probabilistic model for 
nucleosome positioning, which we called NOrMAL, for 
NucleOsome Mapping ALgorithm. NOrMAL uses Expectation 
Maximization (EM) to infer its parameters. To demonstrate the 
performance of our method, we report expe rimental results on 
MAINE- seq data for Plasmodi um falciparum dPonts et a/.Ll201Gb 
and Saccharomyces cerevisiae dWeiner et all l2010h . We compare 
the performance of our method again st the Template Filtering 
(TF) algorithm dWeiner et jZj l2010h . which is considered the 
current state-of-the-art in terms of accuracy and ability to estimate 
sizes of the DNA fragments bound to nucleosomes. We also discuss 
a fundamental limitation of greedy peak-calling approaches in the 
case of overlapping nucleosomes and how our method addresses 
this issue. 

1.1 Previous work 

Several landmark studies have been published in the last few 



years on the chromatin structure of model organisms basec 


on the 


analysis of genome-wide nucleosome maps dAlbert et al. 




2007; 


Field et al. 120081: iMavrich et all 12008a. bl; IPonts et al 




2010; 


Shivaswamv et all 120081: IValouev et all 120081: IZhang and Push 


2011). Existing methods in the literature are based on the analvsis 



of the peaks in the nucleosome occupancy coverages estimated by 
mapping nucleosome-enriched reads to the reference genome. The 
coverage occupancy profile is an integer-valued function defined 
for all genomic locations: given a position i in a chromosome 
the function is equal to the number of sequenced reads that are 
mapped to location i. From a probabilistic point of view, the coverage 
profile represents a non-parametric distribution of the nucleosome 
positions. At the time of writing, the length of the reads obtained by 
second-generation sequencing (e.g. Illumina Genome Analyzer) are 



limited to about 100 bases and the sequencing occurs in the 5 f -> 3 r 
direction. In the case of ChlP-seq/MAINE-seq, sequenced reads that 
can be uniquely mapped to the positive strand originate from the 
left boundary of nucleosome DNA fragments, while reads uniquely 
mapped to the negative strand originate from the right boundary 
(Fig-[D- Recall that nucleosomes are composed of about 146 bp of 
DNA, so if reads are single end and shorter than 146 bp, we expect 
to observe a peak in the forward and a peak in the reverse coverage 
profiles at a distance consistent with the nucleosome size. 

The problem of associating a peak in the forward strand with 
the correct peak in the negative strand can be difficult in the case 
of a large number of complex nucleosome configurations. Some 
authors artificially extend the reads in the 5 f -> 3 f direction or they 
shift the positions of the mapped read position of the forward and 
reverse toward the middle of potential nucleosomes. Then, they 
combine (e.g. sum) the forward and reverse modified coverages 
to build a score function. In both cases, they need to determine 
the amount of the extension or the size of the shift. In the former 
case, the extension should account for the expected length of the 
DNA fragments enriched for nucleosomes; in the latter, the shift 
should be about half of the DNA fragment size. The problem of this 
approach is that no extension or shift that will work equally well 
for all nucleosomes in the genome. While one should expect DNA 
fragments enriched for nucleosomes to be ~ 146 bp, the reality is that 
the digestion process can either leave nucleosome-free DNA in the 
sample, or 'over-digest' the ends of nucleosome-bound DNA. What 
complicate s the matter further is that the rate of d igestion is sequence 
dependent so nucleosomes in 

different genomic locations will end up with different DNA fragment 
size. For this reason, it is advantageous to 'learn' this information 
from the input data. TF dWeiner et allllOld) is the only method we 
know that can handle variable fragment sizes in a specified range, 
whereas other methods require users to decide this value in advance. 

As said, a variety of peak-calling algori t hms have b een also 
deve loped dAlbert et all 120071: iField et all I2008L l200St Kaplan 
et fl/. .l2009l;lMavrich et a/.Ll2008a. bUSasaki et all l2009F valouev 
et q/. j2008l) . Most of these methods have been proposed for the 
analysis of ChlP-chip or ChlP-seq data to determine the position and 
strength of the transcription factors binding to DNA. The problem of 
detecting transcription factor binding sites is similar to nucleosome 
positioning; in both cases we need to infer position of proteins 
binding to DNA from the coverage profiles. However, the size of the 
nucleosomes is significantly bigger than transcription factor binding 
sites, as a consequence the resulting configurations of nucleosomes 
can be more complex. 

To summarize our experience with existing methods on the 
geno me-wide nucleosome study of human malaria parasite (T onts 
et al J2010l l201ll) , peak calling approaches suffer from a variety 
of problems. First, the coverage profile function has to be cleaned 
of high- frequency nois e, typically via a kernel density estimation 
method dParzenl Il962h . The type of kernel and the amount of 
smoothing can drastically affect the results: too much can merge 
adjacent peaks, too little can leave too many noisy artifacts that can 
be interpreted as individual peaks. Second, peak finding algorithms 
have parameters (like the extension and the shift discussed above) 
that are difficult to optimize: a set of parameter can work for a 
region of a chromosome but not for another. Third, peak calling 
do not properly reso lve overlapping nucleosomes. For instance, TF 
dWeiner gfa/lbOlCh uses a greedy strategy: nucleosomes are placed 
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according to the 'best' matching peaks in the score function. Once 
these strong-positioned nucleosome are assigned, TF ignores any 
nucleosome that overlaps with previous ones. It is relatively easy to 
show that for overlapping nucleosomes the greedy strategy does not 
always return the best overall placement (see Section |3]for details). 

2 METHODS 

Next, we propose a parametric probabilistic model to find the most likely set 
of nucleosome that best 'explain' the mapped reads. We cast this problem in 
a modified Gaussian mixture model framework. The problem of positioning 
nucleosomes is then reduced to the problem of learning the parameters of the 
model and finding the distribution of mixture components, which is achieved 
via EM. 




Fig. 2. The proposed graphical mixture model: shaded nodes correspond to 
observed variables, white nodes correspond to hidden variables 



2.1 A probabilistic model for nucleosomes 

We employ a probabilistic model for nucleosome positioning that is described 
by a set of hidden and observed variables. We use N to denote the number 
of DNA fragments obtained after MNase digestion. For any DNA fragment 
ie [l,N], let jc* be the starting position of the 5'-end of fragment i (obtained 
by mapping a corresponding sequenced read), and let variable d{ e {+1, — 1} 
be the strand on which fragment i was mapped (+1 for the positive strand, 
and — 1 for the negative strand). Also, let Zi be the length of fragment i. If we 
use variable mi to denote the position of the center of the fragment i, then 
we have m/ =x; + 0fc)/2. 

We denote with Xi,Di,Zi and Mi the random variables associated with 
variables Xi,di,Zi and ra z , respectively. Since the sequencing process is 5 / -> 
y , the value of Xi is observable by means of mapping a read originating from 
fragment i. Similarly, the strand variable Di is also observable. Variables Z z 
and Mi can be observed directly only if sequencing produces paired-end 
reads, otherwise these variables are hidden. In order to consider the most 
general case, we only deal with the latter case (single-end reads). 

We assume for the time being that the number K of nucleosomes is 
given. We will discuss how to choose K in Section [23] For each DNA 
fragment i, we use a hidden variable Qe[l,K] representing the nucleosome 
to which it belongs. Each nucleosome je[l,K] is described by a set of six 
variables (/x/,ct/, A/, , ST 1 , ttj), where /zy denotes the center position of the 
nucleosome j, crj is the fuzziness associated with the position of nucleosome 
j, Aj describes the length of DNA fragments associated with nucleosome 
j, Sj~ l and Sr 1 represents the variation on fragment sizes for positive and 
negative strands, respectively, and ttj is the probability of nucleosome j. 
The degree of fuzziness captures the variation of the position of a particular 
nucleosome in the population of sampled cells. Well-positioned nucleosomes 
have very low degree of fuzziness. We introduce two variables Sj~ l and 8J l to 
model the variation of the fragment size because MNase does not only digest 
nucleo some-free DNA. Given enough time, it can also digest into the ends of 
the fragments bounds to nucleosomes, the rate of digest i on bei ng sequence 
dependent [see, e.g. jAllan et all l2012t Iweiner et a/1 1201 Ol) l. Since the 
sequence composition of the 5 f end of a DNA fragment can be quite different 
from the 3' end, we need to have two different variables. The value of Q 
is drawn from (1 , 2, . . . , K) with corresponding probabilities (jt\ , 712 , . . . , ttk). 
Parameter ttj models the contribution of j'-th nucleosome to the occupancy 
level, i.e. what portion of the mapped reads belong to nucleosome j. 

Our nucleosome model assumes that our random variables are distributed 
according to a normal distribution. For convenience of notation, we set 
® j = (lJLj,(jj,Aj,8f 1 ,8~r\7r j ) for all je[l,K], and 0 = (0i,...,0^). First, 
we assume that variable Mi associated with the center of the fragment i for 
a particular nucleosome j is distributed as follows 



P(Mi\Ci=j,®)~N(jLj,<jf) 



(1) 



nucleosome j is distributed as follows 

P(Zi\Di=di,Ci =j, ®)~N(Aj, OS/' ) 2 ) (2) 
where Aj represents the expected size of the fragments for nucleosome j, 
and Sf 1 and ST 1 represents the variation of fragment sizes for positive and 
negative strands, respectively. 

Combining Equations {TJ and |2j and relation Xi=mi — (dm)/ 2, and then 
applying the rule of linear combination of independent Gaussians we obtain 

P(Xi \Di=di,Q=j, ®)~N(fjLj-(diAj)/2, af+i&fl2) 2 ) 0) 
Equation |3) allows one to compute the probability of a given data point 
Xi given the parameters of a nucleosome. Next, we describe the model for 
multiple nucleosomes. 

2.2 Mixture model 

Next, we introduce a generative mixture model to describe the likelihood of 
input data points X = (x\ , . . . ,xn). Figure [2] shows a graphical representation 
of the mixture model. In Equation {3), the only hidden random variable is C 
because we already excluded variables Zi from the computation. By grouping 
variables Xi,Di, we can use an approach similar to a naive Bayes classifier. 
Variable C represents the nucleosome to which the points belong. Thus, we 
can describe the likelihood of point (jc/ , d/) given the parameters of our model 
as a mixture of distributions. Using the Bayesian rule we obtain 

K 

P(Xi \Di=di,@) = £>(Ci =j, Q)P(Xi \Di =di, Q =j, 0) 

7=1 
K 

= Y J ^jf^i^j-diA j /2,af+(8f/2f) (4) 

7=1 

wheref(x,a,b) = l/V2jrbe~( x ~ a ^ 2 / 2b is the Gaussian density function. 

Using Equation j4), we can obtain the log likelihood of observed data 
points X given parameters 0 as 

N 

l(X 1 0) =jy 0 gP(Xi = Xi | A = di , 0) 



J>(C f =7\ ®)P(Xi\Di=di, Q=j, 0) 
J= l 

' K 

J2 *jf (*i , ^ - di Aj/2, of + (<5 J (J) /2) 2 ) 

7=1 



(5) 



where \ij represents the center of the nucleosome j and oj is its fuzziness. 
Second, we assume that the length Z; of fragment i for a particular 



Given Equation J5j and the input data points X = (x\ , . . . ,jcjv), we can find 
an estimate the parameters of the model 0 via maximum likelihood 

0 = argmax 0 /(X|0) (6) 

Recall that 0 = (0i,...,0#) is a vector whose components are the 



nucleosome parameters 0/ = (/jlj , Aj , Oj , 8j + l \Sj 



-l) 



Ttj) for all je[l,K]. 
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Algorithm 1 : NOrMAL algorithm 

1 : {Parameter initialization} 

2: hq (/ii , fi 2 , • • • , i*> k) , where in is uniformly distributed 

3: .o^(^,.4)ER K 

4: 9 (t) <- (^^Ao,^ 1 ,^ 1 ,^) 

5: {Soft Learning} 

6: 6 ^ Learn(B (t) ) 

7: {Hard learning} 

8: for all % £ [l,iV] do 
9: d <— argmaxj(Tij) 

10: end for 

1 1 : Recompute cluster parameters 0 

12: return 0 

Algorithm 2: Learn(S) 

1: repeat 

2: while not converged do 

3: Q(0|0W)^£ c|X)0(t) Z(X|0) 

4: 0 (t+1) <- argmax Q Q(e\e {t) ) 
5: *<-J + l 
6: end while 

7: for all j G [1,K- 1] do 

8: if l/xj — /ij+il < threshold then 

9: {Merge clusters z and z + 1} 

10: end if 
1 1 : end for 

12: until no clusters were merged 

13: return 0 (t) 

Fig. 3. A sketch of the proposed NOrMAL algorithm 

The presence of parameters ttj that correspond to hidden variables C z prevents 
us from solving Equation J6) directly. We estimate 0 via EM. In our case, 
the E step requires computing the posterior probabilities P(C Z = j\Xi =xf, 0) 
of data points Xi,ie[l,N] with respect to the distribution of Q given the 
current estimate of parameters 0^ 

Q(®\®W)=E C]x@(t) l(X\@) (7) 

During the E step, we supplement the missing data in Equation |6) with the 
expected values under the current parameter estimates 0 W . In the M step, 
we find new parameter estimation by maximizing Equation Q 

0(*+i) = argmax 0 Q(0|0 (O ) (8) 

It is relatively straightforward to bound the variation parameters 
(<y,8 +l ,8~ l ) during the iterative EM process to converge to a solution with 
'reasonable' parameters. We can also easily introduce prior distribution for 
some of the parameters. For instance, we can specify an expected distribution 
for DNA fragment sizes Ay, which can be estimated via gel electrophoresis 
prior to sequencing. 

2.3 Choosing the number of nucleosomes 

The method described above assumes that the number of clusters K is known. 
The problem of selecting the best value for K is as challenging as selecting 
the optimal number of clusters in k-m&ms clustering. One can estimate the 
number of clusters by looking at the support area of the occupancy coverage, 
but this will be quite inaccurate because 'fuzzy' nucleosomes correspond to 
wider peaks, and the support area is bigger for them. 

Here, we propose a simple but effective heuristic to find K. We 
start by (1) placing the maximum possible number of non-overlapping 
nucleosomes uniformly distributed on the chromosome, that is K = (size of 
the chromosome)/(expected size of a nucleosome), where the expected size 



of nucleosomes is underestimated. Then, (2) we run our EM algorithm until 
convergence ('soft learning'). We will then (3) check the distance between the 
clusters, and merge those that have too much overlap (above a user- specified 
threshold). In case of multiple overlaps for a nucleosome, we merge it with 
the closest one. We repeat (2) and (3) until no additional clusters are merged. 
After a few cycles, we will obtain a set of non-overlapping clusters that best 
explain the given data points. Overlapping nucleosomes are merged into new 
ones and then the position of new nucleosomes are learned from the data. 

This procedure will give us a good estimate on the number of clusters as 
well as a rough estimate of the nucleosome positions. To further improve 
accuracy for other model parameters, we perform one iteration of 'hard 
learning' by assigning each data point Xi its maximum probable cluster. 
Nucleosome clusters will partition the set of input points, which in turns will 
allow us to compute their parameters more accurately. The pseudo-code of the 
algorithm is shown on FigureO The running time of NOrMAL is dominated 
by the running time of Learning step (Algorithm 1, line 6). Observe that the 
probability that a data point belongs to far-away nucleosomes is close to zero, 
so one can avoid unnecessary computations by computing updates only for 
clusters in close vicinity of each point. 

The running time is dominated by the heuristic used to find the number of 
nucleosomes K. In order for the algorithm to scale to eukaryotic genomes, 
additional optimization steps will have to be implemented. For instance, 
during the early stage of soft learning (i.e. active cluster merging), the 
algorithm could be applied to small 'chunks' of chromosomes. Then, 
when the number of merges reduces substantially, the nucleosome maps 
for each chunk could be combined and algorithm would continue to the 
hard-learning step. 

2.4 Practical considerations 

Our method requires users to specify three parameters, namely the threshold 
for allowed overlap between adjacent nucleosomes, the prior A on 
nucleosome sizes and its weight X. 

The threshold for allowed overlap can significantly affect the output: the 
more overlap is allowed, the more nucleosomes can be placed. In the current 
implementation, this parameter has to be specified by the user. The prior A on 
the nucleosome size and its weight X control the propagation of 'knowledge' 
from data points on forward strand to data points on the reverse strand, and 
vice versa. Based on our experience, if the prior size A is within 30 bp of 
the 'true' fragment size, then the algorithm is consistent in its output. 

Our implementation has some additional internal parameters that we 
are not expecting users to change. While inferring the parameters of our 
mixture models, some clusters will tend to cover most of the data points 
using large variances: a common trick to avoid this from happening is to 
introduce hard limits on such parameters. Our implementation has range 
limits for the nucleosome sizes and variance to force the method to converge 
to 'reasonable' nucleosome sizes/variances in the early stage of the iterative 
process. These parameters have been chosen loose enough so by the end 
of the iterative process, the limits for nucleosome size and variance are 
rarely hit, and the output is not significantly affected. The hard-learning step 
completely ignores those upper limits. 

Additional details on parameter selection will be found in the on-line User 
Manual. 

3 EXPERIMENTAL RESULTS 

We carried out extensive benchmarking betwe en our proposed 
method NOrMAL and TF dWeiner et all 1201 Cb . We selected TF 
because it is considered the current state-of-the-art. It is the only 
method that, in addition to inferring nucleosome positions, can 
extract nucleosome fragment sizes and binding scores. TF differs 
from the traditional peak-calling algorithms because it does not 
look for peaks in the coverage profiles, but it places nucleosomes at 
the peaks of a correlation score matrix. Due to its greedy strategy, 
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Table 1. The parameters used to generate the reads in Figure [4] and the 
corresponding output results from TF and NOrMAL 



Parameter 


True value 


TF 


NORMAL 


Ml 


210 


208 


206 


Ai 


130 


125 


134 


V>2 


300 


301 


300 


A 2 


150 


149 


148 



TF has significant limitations when dealing with fuzzy/overlapping 
nucleosomes, as explained next. 

The setup for the comparison is as follows. The input parameters 
for NOrMAL are the prior size of the nucleosome fragments and the 
allowed amount of overlap between nucleosomes. For TF, we used 
default parameters unless specified otherwise. The default allowed 
range of nucleosome size for TF is [100,200], which centered around 
the expected nucleosomes size of ~ 146 bp. 

Synthetic data: first, we want to illustrate the challenge for existing 
nucleosome positioning methods to deal with the placement of 
overlapping nucleosomes. The problem derives from the difficulty 
in distinguishing two overlapping nucleosomes from the 'fuzzy' 
case. To define precisely this problem, we need to introduce a 
threshold parameter: if the percentage of overlap between two 
nucleosomes exceeds the threshold, then they should be considered 
'fuzzy', otherwise they should be treated as separate overlapping 
nucleosomes. It is relatively easy to show that the greedy strategy 
does not always give the optimal nucleosome placement in 
case of overlapping nucleosomes. To do so, we have created a 
small synthetic dataset that contains reads corresponding to two 
overlapping nucleosomes. Although we could have used our own 
parametric model to generate the synthetic data, to avoid the 
possibility of giving an advantage to our method, we generated 
the input data accor ding to the template function described in 
dWeiner et al[ l2010h . The parameters for nucleosome positions 
(/xi and H2) and nucleosome sizes (Ai and A2) that we used to 
generate the reads are reported in Table Q] Figure |4}\ illustrates 
the coverage profile for mapped reads. Observe that there are two 
peaks on forward and reverse strands, which indicates the presence 
of two nucleosomes. The percentage of overlap is roughly 35%. 
In the first case, we allowed such amount of overlap in both TF 
and NORMAL (Fig. gfc for TF and Fig. for NOrMAL). Both 
methods correctly reported two overlapping nucleosomes. Observe 
the error bars attached to the boundaries of nucleosomes reported by 
NOrMAL, which indicate the positional variance (or 'fuzziness') of 

the corresponding boundary (each bar has length 3 J of + <$• 1 ). TF 
does not provide such information. 

In the second case, when the parameters are set so nucleosomes 
are not allowed to overlap >30%, only one nucleosome should 
be reported. Figure 0J) and E illustrates the output of TF and 
NOrMAL, respectively. Observe that now there is a fundamental 
difference: TFs greedy strategy reports the presence of the first 
nucleosome, but then it completely ignores the data corresponding 
to the second nucleosome. This is an entirely arbitrarily choice and 
the user will not be even aware of this. In contrast, NOrMAL reports 
one nucleosome positioned near the centroid of the data points and 
correctly indicates that the variance of the nucleosome boundaries 
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Fig. 4. An example of two overlapping nucleosomes. (A) Coverage profile 
from synthetic data (forward strand on top, reverse strand on bottom), where 
nucleosomes overlap on ~35% of their length; nucleosomes detected using 
TF allowing maximum 35% overlap (B), NOrMAL allowing maximum 35% 
overlap (C), TF allowing only 30% overlap, [100,200] (D) and NOrMAL 
allowing only 30% overlap (E); nucleosome reported by TF with nucleosome 
size range [40,200] (F) and [100,300] (G) 

in this case is very high, indicating that this nucleosome should be 
considered 'fuzzy'. 

In addition, TFs positioning results are very sensitive to its 
main input parameter, namely the allowed range for nucleosome 
fragment sizes. With the default size range [100,200], TF reports 
one nucleosome (Fig.[4j)). When we extend the range to [40,200], 
TF detects one small nucleosome by incorrectly matching the two 
strongest (but closest) peaks (Fig. |4f ). If we change the range to 
[100,300], TF reports one large nucleosome, this time matching the 
outmost peaks (Fig.[4jj). Even if we allow a larger overlap, TF will 
still produce the nucleosome in Figure Hfj (data not shown). Since 
the allowed range for nucleosome size in TF is a hard boundary, 
the results are very dependent from the choice of this parameter. 
NOrMAL is more robust in that regard because the prior distribution 
for the fragment sizes in NOrMAL is 'soft' and it can adapt to 
the data. 

Real data: the challenge for nucleosome position inference is that 
the true positions of the nucleosomes are unknown. The lack of 
a 'ground- truth' makes it very hard to benchmark the existing 
computational methods. To compare between methods, we can only 
use conservative indicators. We argue that a valuable indicator is 
the number of reported nucleosomes. Nonetheless, it is difficult to 



i246 



NORMAL: Accurate Positioning of Nucleosomes 



Table 2. Experimental results on the S. cerevisiae dataset: number of 
nucleosome detected by TF and NOrMAL and corresponding execution 
time (bold numbers indicate the maximum) 



Chromosome 


No of mapped 


TF 


Time 


NORMAL 


Time 




reads 




(s) 




(s) 


1 


16 688 


1033 


1.38 


1078 


6.86 


2 


78 543 


4284 


7.37 


4394 


84.56 


3 


30 589 


1583 


4.43 


1618 


8.49 


4 


138 801 


7 975 


16.36 


8 014 


369.11 


5 


55 601 


2986 


4.02 


3101 


38.80 


6 


26 141 


1403 


1.63 


1453 


4.45 


7 


101 981 


5 727 


9.84 


5 817 


126.34 



argue about performance in objective terms. That is why t he firs t 
dataset we considered is from S. cerevisiae dWeiner et aZLbOlOh . 
This was the original dataset for which TF was designed. The results 
of TF on this dataset are assumed to be accurate. 

To compare NOrMAL and TF, we used the following setup. The 
main range parameter for TF was set to [80,200], which is slightly 
wider than the default parameters. We did not want to penalize TF, 
since NOrMAL does not have any hard limits for the nucleosome 
sizes. For NOrMAL, the main parameter is the prior expected value 
for the nucleosome sizes: we used 140 bp to hit the middle of the 
specified range of TF. The threshold value of allowed overlap for 
both methods was set at 35%. All other parameters were left to 
default values. The results of both methods are reported in Table [2] 
Observe that NOrMAL is slower, but it returns on average 2.6% 
more nucleosomes than TF. 

If we compare the distribution of reported nucleosome sizes 
(Fig. 13, both methods provide consistent results. To compare how 
reported nucleosomes are related to each other, we performed a 
matching procedure. We built a bipartite graph, where a node 
corresponds to a reported nucleosome (each part corresponds to one 
of the two methods). The bipartite graph is fully connected, and the 
weight on edge (u,v) is the squared distance between nucleosome 
u and v: when the distance between u and v exceeded 50 bp, we set 
the weight to oo. Then we solved the weighted assignment problem 
between the two sets using the Hungarian method. The distribution 
of pairwise distances between matching nucleosomes is shown in 
Figure |6] Observe that the distribution is a unimodal bell- shaped 
curve with mean and mode having near zero value. The number of 
matched (common) nucleosomes is 81.44% of the total: 8.29% and 
10.27% are unique to NOrMAL and TF, respectively. 

While the dataset for S. cere visiae is conside r ed to h ave relatively 
stable set of nucleosomes dWeiner et a/1 l2010h . the dataset 
for the huma n malaria parasite P. falciparum has very dynamic 
nucleosomes JPonts et flZlbOlCb . The considered dataset consists 
of seven time-points (namely, 0, 6, 12, 18, 24, 3 0 and 36 h), 
each related to a different stage on the cell cycle feoch et~al[ 
l2003h . The experiment assumes that cells are 'synchronized' at 
each time-point, but the synchronization is not perfect due to 
experimental limitations. As a consequence, we expect a large 
number to nucleosomes to exhibit a 'fuzzy' behavior. 

First, we performed nucleosome placement with the same setup 
as with the yeast dataset. The distribution of fragment sizes is quite 
different in this case (Fig. [8j. NOrMAL reports fragment sizes 
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Fig. 5. Size distribution of reported nucleosomes for S. cerevisiae 
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Fig. 6. Distribution of pairwise distances between corresponding 
nucleosomes reported by TF and NOrMAL for S. cerevisiae 
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Fig. 7. Distribution of distances between corresponding nucleosomes 
reported by TF (range [80,200]) and NOrMAL for P. falciparum 
(chromosome 1) across all seven time-points 

with a mean value of 105 bp and mode value of about 120 bp, 
whereas TF reports a distribution with mean and mode of about 
84 bp. If we perform the matching of the reported nucleosomes, the 
distribution of distances is much wider than for yeast (Fig. [TJ. Now 
only 50% of all detected nucleosomes are in common between two 
methods. A total of 32.38 and 17.62% are unique to NOrMAL and 
TF, respectively. As expected, the disagreement between NOrMAL 
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Fig. 8. Size distribution of reported nucleosomes for P. falciparum 
(chromosome 1) across all seven time-points 

and TF is much higher on this dataset, due to presence of a much 
higher fraction of overlapping/fuzzy nucleosomes. 

Two examples of the disagreement between TF and NOrMAL 
are shown in Figure |9] Forward and reverse coverage profiles with 
extension to 35 bp are shown on top. Nucleosomes are represented 
by ovals, where the height of each oval represents the confidence 
score. NOrMAL also reports the variance associated with the left 
and the right boundary, represented with error bars. In Figure [9] 
(top), we have labeled corresponding nucleosomes 1-6. Some 
observations are in order. First, nucleosome 3 is an incarnation 
of the synthetic example in Figure @J) and E. The coverage 
profile around coordinate 800 shows two heavily overlapping 
nucleosomes that should be reported as one 'fuzzy' nucleosome. 
However, TF reports the position of the nucleosome using the 
stronger pair of forward/reverse peaks and completely ignores 
the other pair of peaks. As a consequence, the coordinate of the 
reported nucleosome is shifted compared to the centroid of the 
four peaks. NOrMAL instead correctly places one nucleosome at 
the centroid with a relatively high 'fuzziness' score. Nucleosome 
3 is also quite fuzzy, and it is better placed by NOrMAL. Some 
disagreement exists on nucleosome 6 as well. The left boundary 
of that nucleosome detected by TF correspond to a very weak 
peak. This is due to the fact that TFs placement is based on 
the correlation score rather than the intensity of the peak. In 
fairness, both methods assign nucleosome 6 a very low confidence 
score. Finally, TF detects additional nucleosomes a and b, whereas 
NOrMAL reports additional nucleosome c. All these nucleosomes 
have low confidence scores. Our method did not report a and b 
because it explained the data using fuzzy nucleosome 2. NOrMAL 
should have merged nucleosome c to nucleosome 1, but because the 
overlap did not exceed the chosen threshold those two nucleosomes 
were not merged. 

Figure [9] (bottom) illustrates a more complex example of the 
coverage profile: even for trained experts placing nucleosomes here 
would be very challenging. The output of NOrMAL and TF are 
quite consistent for nucleosomes associated to strong peaks. In the 
regions with high density of peaks, TF tends to place nucleosomes 
of small sizes (see also Fig. [10} and pack them as tight as possible 
according to allowed overlapping threshold. 
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Fig. 9. Two examples of nucleosome maps for chromosome 1 of 
P. falciparum (top: '0' is location 148 500 bp, bottom: '0' is location 
111500bp): forward and reverse coverage profiles are shown on top; 
nucleosomes are represented by ovals where the height of each nucleosome 
represents the confidence score 

In order to increase the agreement between TF and NOrMAL, 
we tried to extend the range of nucleosome sizes for TF to [20,250]. 
The new fragment size distributions are shown on Figure [Tol 
Observe that by comparing Figures [8l and [Tol the size distribution 
for NOrMAL are the same (only truncated in Fig. [8}, while the 
distribution for TF has changed completely, again pointing out how 
this range parameter can drastically change the results. Using the 
extended range, TF was allowed to place smaller nucleosomes so 
the mode and the mean of the size distribution shifted to smaller 
values. According to the authors of TF, such small nucleosomes 
can be due to problems in the experimental procedure, namely 
overexposing the sample to the MNase digestion. However, the gel 
electrophoresis analysis shows that the expected size of sequenced 
fragments in our samples after digestion was about 130 bp in 
length (without adapters). We speculate that TFs approach might 
have a problem when the range of admissible nucleosome sizes is 
too wide, and the algorithm confuses boundaries of neighboring 
nucleosomes. 
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Template Filtering, range [20,250] 
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Fig. 10. Size distribution of reported nucleosomes for P. falciparum 
(chromosome 1) across all seven time-points 



Table 3. Number of nucleosomes reported for P falciparum (chromosome 1 ) 
for different time-points (TF, parameter in square brackets is the range of 
admissible nucleosome sizes, bold numbers indicate the maximum) 



Time (h) 


TF [80-200] 


TF [20-250] 


NORMAL 


0 


1720 


2031 


1934 


6 


1491 


1826 


1720 


12 


1461 


2158 


2043 


18 


1 185 


1665 


1537 


24 


1440 


1910 


1766 


30 


1723 


2229 


2443 


36 


1701 


2514 


2788 



The number of reported nucleosomes for all time-points is shown 
in Table |3] Observe that for time-points 0-24 h, the number of 
nucleosomes reported by NOrMAL is between TF with range 
[80,200] and [20,250]. However, recall that the extended range 
[20,250] is likely to be unreliable. For time-point 30 and 36 h, 
NOrMAL identifies a higher number of nucleosomes. The 30 and 
36 h marks correspond t o the schizont stage of the P. falciparum life 
cycle feoch et a/.Ll2003l) . During this stage, the parasites divide, the 
chromatin compacts and a large number of nucleosomes are added. 

4 CONCLUSION 

We described a parametric probabilistic model for nucleosomes 
positioning framed in the context on a modified Gaussian mixture 
model. Our method directly addresses the challenges imposed 
by overlapping and fuzzy nucleosomes, their detection and 
the inference of their characteristics. We demonstrated with a 



synthetic example that the current state-of-the-art method does not 
properly handle complex overlapping configurations. We have also 
shown that NOrMAL is significantly more robust to user-defined 
parameters. On real data, our method detects a higher number of 
nucleosomes. Although our method is currently slower than TF, 
the processing time is still modest compared to other steps in the 
sequencing pipeline and we believe the efficiency of NOrMAL can 
be improved. 
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